During the COVID pandemic, the Centers for Disease and Control and Prevention (CDC) published an article showing that the obese population was more likely to die or have further complications due to the virus. There are countless other similar studies that indicate the severity of the health problem that is obesity.
The United States is known for being a land of great opportunity, wealth and freedom. Along with that, however, it is associated with a sedentary lifestyle, fast food restaurants, and obesity. This stereotype doesn’t come from nowhere; the United States has an obesity rate of 36.5%, making the country the most obese among developed countries. Within the limits of the conclusions that the field of data science can make, we decided to answer: Is the availability of fast food restaurants in the USA associated with higher obesity rates? We will also attempt to discover other variables that could be associated with obesity rates and determine how correlated they seem to be with it.
We used data from the Food Environment Atlas by the U.S Department of Agriculture collected in years ranging from 2013 to 2015. They use the Behavioral Risk Factor Surveillance System (BRFSS), the U.S. Census, and the USDA’s Economic Research Service as their sources and they organize their data at the county level. Therefore, we have used each county as one data point. We believe that the fact the data is from different years may cause a slight alteration in the results of our model. This should be negligible since they’re only at most two years apart.
We picked a few variables that we thought could be relevant in determining obesity rates:
obesrate = rate of obesity in each county in the US, 2013
fasfoo = fast-food restaurants per 1000 people in each county in the US,
2014
medinc = median household income, 2015
diab = rate of diabetes in each county in the US, 2013
fitplace = recreation & fitness facilities per 1000 people in each
county in the US, 2014
loaccgro = percent of access to grocery stores in each county in the US,
2015.
To answer our question, we chose to use a two step process. For the first step, we decided to split the initial data set by clustering using k-means. This would allow our models in the second step of the process to make better predictions. Plus, separating our data to create different models will prevent over-fitting. Clustering may also offer additional insights in our data. For example, we can use visualizations to identify new patterns by location.
We clustered on the variables mentioned above (fasfoo, medinc, diab, fitplace, loaccgro), excluding obesity rate as this is our target variable.
Based on the elbow chart below, 4 centers looked the best for k-means. For each of the 4 clusters, we created 4 new data sets, leading us to the second step of our two-step process. For each data set, we trained a random forest (RF) regression model, evaluating them using mean-squared error (MSE). RF also grants us the ability to see variable importance, helping us answer our question of which features are the best predictors of obesity rate. RF’s benefits of bagging and boosting to mitigate under-fitting and over-fitting also made it an appealing option.
summary(county_data)
## state county loaccgro fasfoo
## Length:3120 Length:3120 Min. :0.0000 Min. :0.00000
## Class :character Class :character 1st Qu.:0.1094 1st Qu.:0.03520
## Mode :character Mode :character Median :0.1920 Median :0.04875
## Mean :0.2307 Mean :0.05627
## 3rd Qu.:0.2886 3rd Qu.:0.06490
## Max. :1.0000 Max. :1.00000
## diab obesrate fitplace medinc
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.3069 1st Qu.:0.4609 1st Qu.:0.00000 1st Qu.:0.1704
## Median :0.3861 Median :0.5419 Median :0.07431 Median :0.2321
## Mean :0.3926 Mean :0.5367 Mean :0.08391 Mean :0.2494
## 3rd Qu.:0.4752 3rd Qu.:0.6145 3rd Qu.:0.12934 3rd Qu.:0.3037
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000
Since we use a two step data analysis process, first clustering the data and then running separate models in each cluster, we thought this would be a good opportunity to explore these clusters a little more.
We clustered our parameters in 5 dimensions: one for each of the explanatory variables. Because of that, it would require several plots to show the cluster themselves with little to take away from them. We believed that showing a bar plot with the average of each parameter in each cluster would be more relevant and informative in a simpler manner.
Cluster 1:
Clustered the wealthier counties, with lower diabetes rates highest
proportions of access to fitness establishments.
Cluster 2:
Clustered counties with high access to groceries and lowest fitness
establishments, mainly.
Cluster 3:
Clustered counties with lowest rates of income, highest rates of
diabetes and lowest rate of access to groceries.
Cluster 4:
Seems to have the most balance of all clusters.
Next, we thought that it could be good to see how these clusters are displayed geographically in the country. We calculated the percentage of counties per state that are in each cluster and plotted that information.
Cluster 1:
Mostly New England and California. This is the wealthier cluster, so it definitely makes sense.
Cluster 2:
There are extremely few data points in this cluster. It seems that most of the places where a high percentage of the population has access to groceries is the southwest and the northwest of the country.
Cluster 3:
Concentrates a lot of the southern states aside from Florida and Texas.
Cluster 4:
Seems to be the almost evenly distributed across the country, which makes sense given the bar plot.
## %IncMSE IncNodePurity
## loaccgro 0.0002616216 0.1420106
## fasfoo 0.0019643544 0.2606380
## diab 0.0055175303 0.4533984
## fitplace 0.0006342932 0.1598106
## medinc 0.0012446579 0.2191012
Diabetes was the most important in predicting the obesity rate, which makes sense since these would appear to be highly correlated variables at first glance.
Random Forest MSE:
## [1] 0.006136722
MSE using obesity rate’s mean as prediction:
## [1] 0.01350612
Cluster 2 has very few data points. Because of this I will use 85% of the data as training data and tuning and the rest will be used for testing.
This was a very interesting cluster as it was supposed to concentrate counties with a really high degree of the population with access to groceries and counties with low access to gyms. Initially I believed these features wouldn’t be correlated negatively, but it seems they are, at least in the national level.
## %IncMSE IncNodePurity
## loaccgro 0.000569 0.14883766
## fasfoo 0.001219 0.21474863
## diab 0.002749 0.37094615
## fitplace 0.000086 0.01696633
## medinc 0.001287 0.25967360
Random Forest MSE:
## [1] 0.01233427
MSE using obesity rate’s mean as prediction:
## [1] 0.01798417
The mean squared error of the sample was really low. This is to be expected, since we clustered data into four groups that are similar in determining aspects to obesity rate.
If instead of training the data set, we just guessed the average obesity rate of the training data the MSE would be of 0.01247393. Using the data to train it using the random forest method, gives us a 0.009291936 MSE, which is about 30% lower.
Given there’s such few data points, especially for the testing portion of this, I believe that this is a good result.
Non-surprisingly, the rate of people with diabetes in the county was the most important predictor. Next, the measure of fast food restaurants in the county and median income basically tied as second most relevant measures, and access to grocery stores came in fourth. Access to gyms had basically no relevance at all.
## %IncMSE IncNodePurity
## loaccgro 0.0004148564 0.14368152
## fasfoo 0.0008994852 0.17054091
## diab 0.0041996768 0.32150511
## fitplace 0.0001376759 0.06746348
## medinc 0.0007539823 0.16368026
Random Forest MSE:
## [1] 0.007061758
MSE using obesity rate’s mean as prediction:
## [1] 0.01176137
The random forest model performed better than just guessing the mean obesity rate in both the test and train set once again.
## %IncMSE IncNodePurity
## loaccgro 1.104834e-04 0.13156826
## fasfoo 1.219760e-03 0.22510925
## diab 5.012530e-03 0.38400445
## fitplace 3.298028e-05 0.08183718
## medinc 7.854047e-04 0.22025044
Random Forest MSE:
## [1] 0.008394561
MSE using obesity rate’s mean as prediction:
## [1] 0.01205747
On the fairness side of things, there isn’t too much to look into. When a model’s fairness is to be evaluated, the most important thing to examine is how it treats protected classes. These protected classes could include race, gender, and so on. These could also include proxies for race, gender, etc. such as family statistics, education, income, and so on. However, our model only includes median income as one feature, and that itself wasn’t rated very highly in the importance metrics for three out of our four clusters. Therefore, fairness should not be too much of an issue in our model.
Based on each of our random forest models, rate of diabetes in a county was clearly the best predictor of obesity rate as expected. After diabetes, the number of fast-food restaurants per 1000 people in each county was the next best predictor, ranking 2nd in variable importance for each cluster except the second, for which it basically tied with median household income basically 2nd place.
Meanwhile, for every cluster except cluster 1, the number of recreation/fitness centers per 1000 people was the least useful in predicting obesity rate, while the percent of access to grocery stores ranked last in cluster 1.
In conclusion, given the correlation shown by our model, it looks like factors that may contribute to someone becoming obese (i.e. fast-food restaurants) are likely to be a better indicator than factors that would help prevent someone from becoming obese (i.e. fitness center).
Having additional factors would have been beneficial to look into, considering we only had 5 features. Our features were also more on the obvious side of predicting obesity, considering the outcomes weren’t all too surprising. Including variables such as population demographics (i.e. age, ethnicity, etc.), population density, screen time (i.e. iPhone screen time), number of registered vehicles per household, and public transportation budget could be interesting to look into.
Furthermore, we could split our initial data set up into different ways. For instance, we could include more or less clusters. Or we could intentionally cluster certain counties together to make up specific subgroups such as organizing by state, region, time zone, etc.